Automatic standardisation of texts containing spelling variation

نویسنده

  • Paul Rayson
چکیده

Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual standardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2’s performance on a corpus of Early Modern English letters and a corpus of children’s written English. The software’s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LGeRM: lemmatization of Middle French words

Unlike most modern languages, Middle French is a language whose spelling is not yet stabilized. There is a great deal of variation in the spelling of a word and accordingly the traditional methods for lemmatization cannot be used. LGeRM (lemmes, graphies et règles morphologiques) proposes a solution based on a databank containing known lemmatized spellings and a set of graphical and morphologic...

متن کامل

The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic?

The identification of spelling variants in English and German historical texts: manual or automatic? Dawn ARCHER (University of Central Lancashire) Andrea ERNST-GERLACH, Sebastian KEMPKEN, Thomas PILZ (Universität Duisburg-Essen) Paul RAYSON (Lancaster University) The identification of spelling variants in English and German historical texts: manual or automatic?

متن کامل

Alts, Abbreviations, and Akas: Historical Onomastic Variation and Automated Named Entity Recognition

The accurate automated identification of named places is a major concern for scholars in the digital humanities, and especially for those engaged in research that depends upon the gazetteer-led recognition of specific aspects. The field of onomastics examines the linguistic roots and historical development of names, which have for the most part only standardised into single officially recognise...

متن کامل

Automating Multi-Level Annotations of Orthographic Properties of German Words and Children’s Spelling Errors

This paper presents the automatic annotation of orthographic properties of German words and spelling errors in texts of German primary school children according to a new multi-layered annotation scheme [1]. The scheme is closely linked to the principles of the German writing system and is supposed to allow the pursuit of new research questions concerning the relationship between spelling errors...

متن کامل

Detecting spelling variants in non-standard texts

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as a non-standard spelling for tomorrow. Consequently, in normalization – the standard approach of dealing with spelling variation – so-called non-standard words are mapped to their corresponding standard words. However, the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009